ON A NEW CORRELATION COEFFICIENT, ITS 
ORTHOGONAL DECOMPOSITION AND ASSOCIATED TESTS 

OF INDEPENDENCE 

O ■ By Wicher Bergsma^ 

O, 

^SJ , London School of Economics and Political Science 

j;^ ^ ^ A possible drawback of the ordinary correlation coefficient p for two 

♦"p, real random variables X and Y is that zero correlation does not imply 

■^r ' independence. In this paper we introduce a new correlation coefficient p* 

which assumes values between zero and one, equalling zero iff the two vari- 
ables are independent and equalling one iff the two variables are linearly 
related. The coefficients p* and p^ are shown to be closely related alge- 
braically, and they coincide for distributions on a 2 X 2 contingency table. 
We derive an orthogonal decomposition of p* as a positively weighted sum 
of squared ordinary correlations between certain marginal eigenfunctions. 
Estimation of p* and its component correlations and their asymptotic 
r~| i distributions are discussed, and we develop visual tools for assessing the 

nature of a possible association in a bivariate data set. The paper includes 
consideration of grade (rank) versions of p* as well as the use of p* for 
contingency table analysis. As a special case a new generalization of the 
Cramer-von Mises test to K ordered samples is obtained. 



00 

H 

00 



a 



^ '. Contents: 

o 

f^ ' 1 Introduction 

2^ ■ 2 Properties of the kernel function hp 4 

\^ I 2.1 Key properties of hp 5 

(^ • 2.2 Spectral decomposition of hp 9 

2.3 Obtaining the eigensystem in the discrete case 13 

2.4 Obtaining the eigensystem in the continuous case 14 

a 2. 5 Discrete approximation of the continuous case 17 

2.6 Relation to Anderson-Darling kernel 18 

> 

• i-H . 3 Properties of k and p* 18 

^ ' 3.1 Key properties of k and p* 19 

3 . 3.2 Orthogonal decomposition 22 

3.3 Parameterization of the likelihood 24 

3.4 Frechet bounds for component correlations 24 






^Supported by The Netherlands Organization for Scientific Research (NWO), Project Number 
400-20-001. 

'^AMS 2000 subject classifications. Primary 62H20; secondary 62E99 

Key words and phrases, correlation, covariance, test of independence, spectral decomposition, 
eigenvalues, eigenfunctions, Hilbert-Schmidt operator, FYechet bounds, contingency tables, phi- 
square, canonical correlation, Cramer-von Mises tests, rank tests, Fredholm integral equation of 
the second kind. 



2 Wicher Bergsma 

4 Estimation and tests of independence 27 

4.1 U and V statistic estimators of k 27 

4.2 Permutation tests 28 

4.3 Asymptotic distribution of estimators under independence 29 

4.4 Bonferroni corrections for testing significance of component correla- 
tions 30 

5 Grade versions of k and p* , copulas, and rank tests 31 

5.1 Rank tests 32 

5.2 A new class of iiT-sample Cramer- von Mises tests as a special case . 32 

5.3 K as a weighted (/)-coefficient 33 

6 Data analysis: investigating the nature of the association 34 

6.1 Some artificial data sets 35 

6.2 Mental health data 41 

6.3 Norwegian stock exchange 43 

6.4 Discussion 43 

1. Introduction We introduce a correlation coefficient p* which has the po- 
tential advantage compared to the ordinary correlation p that it detects arbitrary 
forms of association between two real random variables X and Y . In fact p* , to be 
defined below, can be viewed as a simple modification of p^ , as we now show. The 
ordinary covariance is defined as 

cov(X, Y) = E{X - EX){Y - EY) 

and the ordinary correlation as 

cov(X, Y) 

v/cov(X, X)coy{Y, Y) 

Now suppose that Z , Z\ and Zi are iid with distribution function F ^ that X and 
Y have marginal distributions F\ and F2, and that [X\^Y\)^ and (Xi^Y^) are iid 
replications of (X, Y) . Then with 

(1.1) uf{z^,Z2) - (zi - EZ){z2 - EZ) = E{zi - Zi){z2 - Z^) 
it is easy to verify that 

(1.2) C0v(X,r)2 = Suf,(Xi,X2)uF.(n,1^2) 

Now straightforward algebra based on the left hand side of (1.1) shows that we can 
rewrite up as 

uf(zi, Z2) = -\e{ |zi -Z2f - W -Z2f- \Zi - Z2|2 + |Zi - Zai" ) 
Replacing the squares in up by absolute values then gives 

(1.3) hF{zu Z2) = -hE{ \zi - Z2I - |zi - Z2I - |Zi - Z2I + \Zi - Z2I ) 
and we can define a new 'covariance' k by replacing u by h in (1.2): 

>i{X,Y) = EhFAXi,X2)hFAYuY2) 
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Now we can also define 



p*{X,Y)= ^ ' ' 



^k(X,X)k{Y,Y) 

Thus, whereas the squared covariancc and p^ are based on squared differences, k 
and p* are based on absolute differences. In this paper we demonstrate the perhaps 
surprising result that < p*{X,Y) < 1, such that p*{X,Y) = iff X and F 
are independent, and p*{X,Y) = 1 iff X and Y are linearly related. A further 
main result we give is an orthogonal decomposition of p* in terms of component 
correlations between eigenfunctions of /ifi and hF2 ■ 

Based on their formulas, the following statistical interpretation of p^ and p* can 
be given: they measure how much two X observations which are 'far' apart tend 
to occur with Y observations which are 'far' apart, and similarly how much two X 
observations which are 'close' together tend to occur with Y observations which are 
'close' together. 

This paper is organized as follows. In Section 2, the properties of the kernel 
function hp are investigated in detail. Some general properties are given, including 
conditions for its existence and a proof that it is positive, and a large part of the 
section is devoted to the spectral decomposition of hp- We show that if hp is 
square integrable, it has a mean square convergent spectral decomposition in terms 
of the eigenvalues and vectors of hp- For discrete F, a set of difference equations is 
given which has this eigensystem as its solution, and for continuous F an analogous 
differential equation is given. The numerical solution of these equations is treated 
in some detail. Closed form solutions are only available in some special cases, for 
example, if F belongs to the uniform distribution on [0, 1], the eigenfunctions of hp 
are the Fourier cosine functions. 

The results of Section 2 are used in Section 3 to derive properties of p* . We 
demonstrate the aforementioned result that < p* {X, F) < 1 , such that p* {X, Y) = 
iff X and Y are independent, and p*(X, Y) = 1 iS X and Y are linearly related. 
Furthermore, a decomposition of p* is given in terms of a sum of squared correlations 
between marginal eigenfunctions of hp-^ and hp^ weighted with the product of the 
corresponding marginal eigenvalues. We give a parameterization of the likelihood in 
terms of component correlations of p* and the marginal eigenfunctions, somewhat 
analogous to the well-known canonical correlation decomposition. Frcchct bounds 
for the component correlations are discussed, which gives some insight into the 
possible structure of the dependence between two random variables. Finally in this 
section, component correlations for the normal distribution are discussed as an 
illustration. 

In Section 4, we derive sample and unbiased estimators of k(X, Y) and related 
estimates of p*{X, Y), which can be calculated in time 0{n^). The asymptotic dis- 
tributions of the estimators under independence, which is a mixture of chi-squares, 
is derived. Finally, small sample permutation tests and Bonferroni corrections for 
testing the significance of component correlations arc discussed. 

Section 5 concerns grade versions of k and p* , copulas, and rank tests. Rank 
statistics, obtained from the grade versions of k and p* , are discussed. It is shown 
that the two sample Cramcr-von Mises statistic is obtained as a special case, as well 



4 Wicher Bergsma 

as a new generalization to the case of K ordered samples. Furthermore, it is shown 
that p* is a weighted mean of phi-square coefficients obtained from collapsing the 
distribution onto a 2 x 2 table with respect to cut-points {x,y). 

In Section 6 we propose a methodology for gaining an understanding of the 
association between two variables from a data set. The methodology is based on 
combining hypothesis tests with visual tools for displaying how much individual 
observations contribute to the association. 

Although many of the results of the present paper are new, we have, naturally, 
also borrowed much from the literature, particularly concerning the eigensystems 
and orthogonal decompositions. Some important references here are Anderson and 
Darhng (1952), Durbin and Knott (1972), De Wet and Venter (1973) and De Wet 
(1987), among others. However, the focus of much of the literature we refer to is on 
studying power of hypothesis tests. The aim of this paper, on the other hand, is on 
providing a meaningful coefficient for describing association, which we hope leads to 
a useful methodology for gaining an understanding of the association between two 
variables, and, along the way, to tests with high power against salient alternatives, 
the salience of the alternatives being determined by the size of p* . 

Throughout this paper, we use the following conventions and assumptions. We 
assume that {X,Y), {Xi,Yi) and (^2,12) are iid with marginal distribution func- 
tions Fi and F2, respectively, and joint distribution function F12. We impose no 
restrictions on the distributions, i.e., they may be continuous, discrete, or mixed 
continuous-discrete. For simplification of some of the derivations, we define distri- 
bution functions in the following slightly non-standard way: 

Fi{x) = P{X <x) + iP{X = x) 

F2{y) ^ P{Y < y) + iP{Y ^ y) 

and 

Fi2(a;, y) = P{X<x,Y <y) + iP{X < x,Y ^ y) + 

\P{X^x,Y<y) + \P{X^x,Y^y) 

2. Properties of the kernel function hp In this section a detailed descrip- 
tion is given of the kernel function hp defined by (1.3). Section 2.1 concerns ex- 
istence, continuity, positivity, square integrability, existence of the trace and the 
shape of the graph of hp- Methods for verifying whether several of these properties 
hold are given. In Section 2.2, under the assumption of square integrability of hp, its 
spectral decomposition is given, and some properties of the associated eigenvalues 
and functions are derived. The eigensystem is the solution to an integral equation 
which may be difficult to solve. The problem is reformulated in terms of difference 
equations for the discrete case in Section 2.3 and in terms of a differential equation 
for the continuous case in Section 2.4. Both rewrites appear much easier to solve 
than the integral equation. In Section 2.5, efficient numerical approximation of the 
eigensystem of hp for continuous F is discussed. For several well-known distribu- 
tions, including the uniform and the normal, closed form solutions or numerical 
approximations of (parts of) the eigensystem are given. A new distribution F is 
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introduced which has the seemingly rare property that hp is square integrable but 
has infinite trace. In Section 2.6 the relation between hp and a kernel introduced by 
Anderson and Darling (1952) is given. We arc not aware of the kernel hp, depending 
on F, having been described previously. 

2.1. Key properties of hp The kernel hp exists if hp{zi,Z2) is finite for some 
(zi, Z2) G R . The kernel hp is positive if 

(2.4) Eg{Z^)g{Z2)hp{Z^,Z2)>Q 

for every function g : R ^ R for which the expectation exists. The kernel hp is 
square integrable if 

(2.5) EhpiZr, Z^f - I hp{zi, Z2fdF{zi)dF{z2) 
is finite. The kernel hp is trace class if its trace 

tT{hp) = EhpiZ, Z)= I hpiz, z)dF{z) 

is finite. In several lemmas below we give some relatively easily verifiable conditions 
for checking whether these properties hold for hp. The final Lemma 4 concerns the 
shape of the graph of hp. 

The next lemma may simplify verification of the existence of hp, and asserts 
continuity and positivity as well as giving another integral representation of hp. 
First we need the following notation: 

{0 X > y 
i a; = y 
I X <y 

Note that 

(2.6) F{z)^E-i{Z,z) 

(2.7) l-F{z) = E-i{z,Z) 

Lemma 1 . // hp exists it exists on R^ . It is then continuous and positive, with 
equality in (2.4) only for the constant function, and has the representation 

/•OO 

hp{zi,Z2)= / ['y{zi,w) - F{w)]['y{z2,w) - F{w)]dw Vzi,Z2 



Proof. Wc first show continuity of hp on its domain. Let 5 > 0. Then if 

|zi — zJI < S and \z2 — Z2\ < 6, 

\hp{z[,Z2) - hp{zi,Z2)\ 
= k\E [i\z[ - 41 - kl - ^2|) - (14 - Z2\ |Z1 - Z2\) {\Z'2 -Z,\-\Z2- Z^\)] \ 

<^E[26 + S + S] 
= 2(5 
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Hence, hp is continuous and bounded on any finite domain. Therefore, if hp exists 
in one point it exists on R^. 

We next derive the integral representation of hp. We have 

/oo 
[7(21, w) --f{z2,w)]'^dw 
-00 

and with 2^:4 the ith largest number in the set {zi, Z2, Z3, 2:4}, we have 
|[7(zi, w) - j{z3,w)][j{z2,w) - 7(24, w)]| dw = 

if 2:1,23 < Z2, Z4 or 21, 2:3 > Z2, 24 



^ ^ 23:4 — 22:4 othcrwise 

Now we can derive the desired result first using (2.8), then applying Fubini's theo- 
rem which is justified because (2.9) is finite, and finally using (2.6): 

/oo 
([7(21, w) -7(22,^)]^ - [7(21, w) -7(^2,^)]^- 
-00 

b(Zi,w) -7(22,w)]^ + [j(Zi,w) -j(Z2,w)]^)dw 

/•CO 

^E [-f{zi,w) - -f{Zi,w)][-f{z2,w) - j{Z2,w)]dw 



(2.10) =/ [-f{zi,w)~F{w)Mz2,w)-F{w)]dw 

J —00 

Finally, we show positivity of hp. Let g be nonconstant. Then using (2.10) and 
Fubini's theorem, 

Eg{Zx)g{Z2)hp{Zx,Z2) 
= E I g{Zi)Yi{Z^,w)-F{w)]g{Z2)[i{Z2,w)-F{w)]dw 

-00 

{Eg{Z)[-i{Z, w) - F{w)]f dw>0 

Hence, hp is positive. If g is constant it is easily verified that the expression is zero. 

D 

Note that from the lemma, it follows that for checking existence of hp, it suffices 
to check existence of hp{0, 0). Now hp{0, 0) has the following convenient represen- 
tations 

hpiO,0) = ^E{\Zi\ + IZ2I - \Zi - Z2\) 

/O I'oo 

F{zfdz + [1 - F{z)fdz 

-00 Jo 

These representations arc immediately verified from (1.3) and from the representa- 
tion of hp given in Lemma 1. 
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An example of a random variable for which hp does not exist is Z = V^ , where 
V has a Cauchy distribution. This can be verified by checking that (2.11) does not 
converge. 

By giving some alternative representations of (2.5), the next lemma may be 
helpful in the verification of square integrability of hp ■ 

Lemma 2. We have: 
{2.l2)EhF{Zi, Z2f = \e{Z2-a - Z^.aY = 2 [ F(zi)2[l - Fiz2f]dzidz2 

Proof. To prove the first equality, write a^- ~ \Zi — Zj\. Then, since for 
example Eai2 ~ Eai^ and _Ea34ai2 — £^034015, we obtain 

EhF{Z\,Z2Y = -7E {ai2 - ai3 - 024 + 034) (ai2 - ai5 - a26 + ase) 
1 , 2 

— T-E (aj2 ~' Ol2ai5 — 012026 + 012056 

— 013012 + 013015 + 013026 " 013056 

— O24O12 + O24O15 + O24O26 — O24O56 
+O34O12 — O34O15 — O34O26 + 034055) 

= 7^ ('^12 - 012013 - O12O24 + O12O34) 

f ^ ^2 

= T7-^ ("12 ^ «13 - 024 + 034) 

fo 
It may now be verified that 012 — 013 — 024 + 034 equals ii Zi, Z4 < Z2,Z3, 
or Zi^Zi > ^2,^3, and equals ±2{Z2-a — ■2^3:4) otherwise. Hence, and because 
-P(oi2 — 013 — 024 + 034 7^ 0) = 2/3, we have 

EhpiZi, Z2Y = 77: (012 - 013 - 024 + 034) 
Id 

= ^ X ^ X 4^(^2:4 - Z^A? = \e{Z2A - Z^A? 

which is the first part of the lemma. 

To prove the second equality, first note that 

(2.13) E-i[Z, z{)-f{Z, Z2) = min{F(zi), F{z2)} 

We now have using Lemma I, Fubini's theorem (for justification see proof of 
Lemma I) and (2.13), 



EhpiZi, Z2Y = E \ Yi{Z^,z{) - f (zi)][7(Z2, zi) - £(zi)]dzi x 

[7(Zi, Z2) - F{z2)\Yi{Z2,Z2) - F{z2)]dz2 

E[-f{Z,,zi) - Fizi)MZi,Z2) - Fiz2)] X 
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E[j{Z2, Zi) - F{zi)][j{Z2, Z2) - F{z2)]dzidz2 
[min{F(zi), F{Z2)} - F{zi)F{z2)fdzidz2 

2/ Fizif[l-Fiz2)fdzidz2 

J Zi<Z2 



D 



An example of a distribution for which hp exists but is not square integrablc is the 
Cauchy distribution. In particular, hpiO, 0) = 2tt^^ log 2, so hp exists. In this case, 
nonexistence of EhpiZi, .^2)^ can most easily be verified using the right hand side 
of (2.12) and a computer algebra package such as provided in Mathematica. 
The next lemma may be helpful in verifying whether or not hp is trace class: 



Lemma 3. We h 



ave 



tT{hF) = iE\Zi - Z2I = [ Fiz)[l - F{z)]dz 
which is finite iff Z has finite mean. 

Proof. The first equality follows directly from (1.3), the second is well-known 
and can be found using similar techniques as in the proof of the second equality in 
Lemma 2. As is well-known, integration by parts leads to the representation of the 
mean as 

EZ = [1 - F{z)]dz - / F{z)dz 

Jo J -00 

so the mean exists iff the terms on the right hand side exists. Now since 

/•oo /"OO /"OO 

F{Q) j [1-F{z)]dz< F{z)[l - F{z)]dz < [1 - F{z)]dz 

Jo Jo Jo 

and 

/O rO M 

[1-F{z)]dz< F{z)[l-F{z)]dz< I F{z)dz 

-00 J —00 J —00 

it follows that 

/•oo pO 

ti{hF)=l F{z)[l- F{z)]dz+ F{z)[l- F{z)]dz 

Jo J-00 

exists iff EZ exists. D 

The quantity E\Zi — Z2I is also called Gini's mean difference. Note that, by Lem- 
mas 2 and 3, both EhpiZi, Z2Y and iv{hF) can be used as measures of dispersion. 

An example of a distribution function F for which hp is square integrablc but 
not trace class is given in Example 2 in the next subsection. 

We conclude this section with a lemma concerning the shape of the graph of hp. 
In Figure 1 a representation of the graph of hp with F the CDF of the normal 
distribution is given. The statements of Lemma 4 can be verified in the plot. Then: 
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Lemma 4. Suppose F is such that hp exists. Then: 

1. For given Zi, hp{zi,Z2) is strictly decreasing in Z2 on the domain {z : z > 
zi,F{z) < 1} and strictly increasing in zi on the domain {z : z < zi,F{z) > 
0}. 

2. hF{z,z) is strictly increasing in z on the domain {z : F{z) > i} and strictly 
decreasing in z on the domain {z : F{z) < i}. 

Proof. Part 1: With zi, 22,23 such that zi < Z2 < zz and F{zi) < 1, we 
obtain using Lemma 1 that 

hpizi^zs) - hF{zi,Z2) == / [7(21, w) - F{w)][y{z:i,w) ~ -i{z2,w)]dw 

2:3 
[1 - F{w)]dw 

<0 

where the strict inequahty holds because -^(23) < 1. This proves the strict decreas- 
ingness part, the strict increasingness is proven by appropriately reversing signs in 
the above. 

Part 2: For z < z' we have 

1*00 
hF{z,z) — hp{z' , z) — / [j{z,w) ~ F{w)] ~ [^{z^w) — F{w)] dw 



[1 - 2F{w)]dw 

which is positive if F{z') > F{z) > i and negative if F{z) < F{z') < \, proving 
the monotonicity relations. □ 



2.2. Spectral decomposition of hp A sequence of random variables Z^^' is said 
to converge to Z in mean square if 



For a distribution function F, we define 



L2{F)^ .g:R^R 



g{z)'^dF{x) < 00 

as the set of square integrable functions with respect to F. A set of functions {gk} 
is said to be orthonormal if 

(2.14) Egk{Z)gi{Z) ^ 5u 

(with 5 the Kroncckcr delta) and 

EgkiZf = 1 
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Fig. 1. Graph of hp with F the CDF for the standard normal distribution 



An orthonormal system {gk} is complete if for any function g e L2{F) there exist 
numbers {ak\ such that 

oo 

9{Z) ^^akgk{Z) 

fc=i 

with convergence in mean square. Then the system {gk} is also called a basis of 
L2{F). A number A is called an eigenvalue of hp with corresponding eigenvector g 
if Eg{Zf = 1 and 



(2.15) 

Then we have: 



Xg{z)^EhF{z,Z)g{Z) 



Theorem 1. Suppose hp is square integrable. Then there exists a complete 
system of orthonormal functions {gk} of L2{F) consisting of eigenf unctions ofhp, 
the corresponding eigenvalues {Xk} being nonnegative. Each gk is continuous and 
satisfies 

(2.16) EgkiZ) = 

if Xk 7^ 0. Furthermore, hp has the spectral decomposition 



(2.17) 



hp{Zi, Z2) = ^ Xkgk{Zi)gk{Z2) 



k=l 



where convergence is in mean square. 



Proof. A continuous square integrable kernel on a sigma-finite measure 
space is a Hilbert-Schmidt kernel which by a generalization of Mercer's theorem 
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has the desired spectral decomposition (Zaanen, 1960). It is easy to verify that 
(R, B, F) is a cr-finite measure space, so the operator mapping the function g to 
^ hp{x,y)g(y)dF{y) is a Hilbert-Schmidt operator. Nonnegativity of the eigen- 
values follows from positivity of hp (see Lemma 1). Furthermore, from (2.15), 
XEgk{Z) = so Egk{Z) — for nonzero Afc. □ 

Note that if 5 is a solution to (2.15), then so is —g. To identify the solutions, 
we proceed as follows. A function g is initially positive if there exists a z such 
that g{z) > and g{z') < for all z' < z. Without loss of generality, we may 
assume that the gk are initially positive. We also assume the eigenvalues are ordered: 
Ai > A2 > . . .. An interesting property of eigcnfunctions of homogeneous positive 
Fredholm integral equations of the second kind is that they are oscillating, in the 
sense that for every k, gk has k distinct zeroes and no more than k—l local extremes. 

Some further results are as follows: 

Lemma 5. For square integrable hp: 

1. if finite, tr{hp) = J2T=o ^k 

2. Ehp{Z,,Z^f = Y.Z,M 

Proof. From the spectral decomposition (2.17) and Fubini's theorem, 
Ehp{Z, Z) = Ej^^kghiZf = ^ Afe 
For the second part, the desired result is obtained by Parseval's theorem. D 

As an example, we consider the dichotomous case which is the simplest possible 
and has closed form solutions: 

Example 1. Consider the dichotomous case thatP{Z = 0) = 1—P{Z = 1) = p. 
Then 

hFiO,0) = il-pf 

hFiOA) = -p{i-p) 

hF{l,0)^-pil-p) 
hFil,l)=p^ 

The eigenvalues are (Ao,Ai) = (0,p(l — p)) and the eigcnfunctions are go{z) = 1 
and 

5i(z) = ^=L=[z-(l-p)] 
which has mean zero and variance one. Now hp{zi,Z2) — Xigi{zi)gi{z2)- 
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Solutions to (2.15) for various other F are given in Table 1, see Section 2.4 for an 
explanation. 

Equation (2.15) is a homogeneous Fredholm integral equation of the second kind 
based on the degenerate kernel hp (see, for example, Tricomi, 1985). In general 
these equations are difhcult to solve. In Sections 2.3 and 2.4, (2.15) is reduced to 
a simpler problem for the discrete and continuous case, respectively. We conclude 
this section with an alternative formula to (2.15) for finding the eigensystem which 
we use in the next two subsections. With 



G{z) = / giy)dFiy), 

■.Z — oo 

we have: 

Lemma 6. The nonzero eigenvalues and eigenvectors ofhp are solutions to the 
equation 



A [g{z) - g{z')] + / G{w)dw = 

J z 

for all z < z' . 

Proof. Substitution of the expression for hp of Lemma 1 into (2.15) yields 
A.g(^)-/ / b(z,w)~F{w)Mz2,w)~F{w)]g{z2)dF{z2)dw 



— oo J — oo 



[7(2, w) — F{w)]G{w)dw 



F{w)G{w)dw + [1 - F{w)]G{w)dw 

) J z 

The Lemma follows from this and since by (2.16) G{z) ^ as \z\ -^00. D 

We have the following interesting implication: 

Corollary 1. If {a,b) is an interval with zero probability mass, i.e., F{a) — 
F{b), then a solution g{z) to (2.15) is linear on {a,b). 

Proof. If F(a) ~ F{b) then from its definition it follows that G is constant 
on (a, b). Therefore, for any z, z' G (a, 5), we obtain from Lemma 6 that 

A [g{z) - g{z')] = c(z - z') 

for some constant c. Hence, g is linear on (a, b). □ 

Note that for discrete distributions, it follows that the eigcnfunctions g are piecewise 
linear. 
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2.3. Obtaining the eigensystem in the discrete case We consider the case that 
Z is a.s. in a finite set, say P{Z € {zi, Z2, ■ ■ ■ , zk}) — 1- We use the shorthand 
P[Z ~ Zi) ^ Pi and assume without loss of generaUty that pi > and Zi < z^+i for 
all i. Then we obtain: 

Lemma 7. With Ci ~ [zi — Zi^\)^'^ the nonzero eigenvalues and eigenvectors 
of hp are solutions to the equations 

Pig{zi) = \c2[g{zi) - .9(22)] 

Pk9{zk) = >^CK[g{zK-i) - g{zK)] 

Pig{zi) = A [agizi^i) - (q + ^+1)3(2;^) + Ci+ig{zi+i)] 2<i< K -1 

Proof. First note that from the definition for z E {zi, z^+i), 

i 

G{z) ^^pjg{zj) 
i=i 
It follows that 



G{w)dw = (zi+i - Zi)22Pj9{zj) 
0=1 



The first two displayed equations of the lemma now follow from (2.18), Lemma 6 
and from '^Pig{zi) — 0. From (2.18) we further obtain 

/•zi+i rzi 

(2.19) Ci+i / G{w)dw - Ci I G{w)dw ^ pig{zi) 



Again from Lemma 6, we have 



Zi + l 



^ [gizi) ~ g{zi+i)\ + / G{w)dw = 

J Zi 

A [g{zi^i) - g{zi)] + / G{w)dw ^ 

Multiplying these equations by Ci+i and Ci respectively, taking differences and the 
use of 2.19 yields 

Pigk{zi) = A [Ci5fe(zi_i) - {ci + Ci+i)gk{zi) + Ci+igk{zi+i)] 

which completes the proof. □ 

More details on difference equations of the form given in Lemma 7 can be found 
in Agarwal (1992). Lemma 7 allows fast and memory efficient computation of the 
eigenvalues and vectors. In matrix notation, we must solve the generalized eigen- 
value problem 

DpQ = \Gg 



14 
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where g is the eigenvector with corresponding eigenvalue A, Dp is a diagonal matrix 
with the coordinates of p on the main diagonal, and 



C 



/C2 

C2 






V 



C2 



-(C2 +C3) 
C3 






C3 

-(C3 + C4) 
C4 






C4 

-(C4 +C5) 



\ 






CKJ 



CAT 

Cat/ 



This method can also be used for fast approximation of continuous systems (see 
Section 2.5). The exact solution for continuous systems is given in the next subsec- 
tion. 

2.4. Obtaining the eigensystem in the continuous case If F is differentiable, 
the problem of finding the eigenvalues and vectors can be reduced to a differential 
equation which is sometimes easier to solve: 



Lemma 8. Suppose f is the derivative of F. Then the nonzero eigenvalues and 
eigenvectors of hp are solutions to the equation 



(2.20) 

subject to the side condition 



Xg"{z) + f{z)g{z) = 



/(z) ^0 as |z| ^ 00 



Proof. Letting z' 



z in Lemma 6 we obtain 
Xg\z) + G{z) ^0 



Differentiating both sides with respect to z yields (2.20). Finally, since by (2.16) 
G{z) -^ as \z\ -^ 00, the side condition follows. □ 

Equation (2.20) leads to an interesting observation on the behavior of the eigen- 
functions gk' the second derivative g'l{x) is proportional to the local density f{x) 
times gk{x). As mentioned in Section 2.2 and as follows from the theory of Sturm- 
Liouville differential equations, the eigenfunctions oscillate. Now if the local den- 
sity is low, the oscillatory behavior will be slower. In particular, on intervals where 
the local density is zero the second derivative of an eigenfunction is zero so the 
eigenfunction is linear. (NB: this also holds for non-continuous distributions, see 
Corollary 1). 

If F is both differentiable and invertible, we can reformulate differential equa- 
tion (2.20) in the standard Sturm-Liouville form: 
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Lemma 9. Suppose F is invertible. Let q{u) — g[F^^{u)] and suppose q is twice 
differentiable. Then the eigenvalues and eigenvectors are solutions to the equation 

(2.21) A.f[F-\u)W{u) + \-^q{u) = 

subject to the side condition 

f[F-^{u)]q'{u) -^OasulOoru'il 



Proof. Note that 



du^ ^"^ f[F-^u)] 



so that 



f[F-\u)]q'{u)=f[F-\u)]^g[F-Hu)]^g'[F-Hu)] 

au 



From this the side condition follows. Substituting z = F ^(u) into the left hand 
side of (2.20) yields 

Xg"[F-\u)]-f{F-\uMF-\u)] 

= \f[F-Hu)] ^ f[F-^(u)]q'(u) - f[F-HuMu) 
Hence dividing both sides of (2.20) by Xf[F^^{u)] yields the desired result. □ 

Note that for the qk the orthonormality condition (2.14) reduces to 

qk{u)qi{u)du == 4; 



In general, the differential equations (2.20) and (2.21) do not have closed form 
solutions (i.e., solutions in terms of well-known functions). We obtained the solu- 
tions for the uniform, the logistic and the exponential distributions which are given 
in Table 1, where Pk is the fcth Legendre polynomial, Ji is the ith Besscl function 
of the first kind, a^ the fcth zero of Ji and 

Numerical solutions were obtained for the normal, Laplace and chi-square distri- 
bution with one degree of freedom, see the next subsection for details on obtaining 
these solutions. For the normal and Laplace distributions we obtained the exact 
value of ^ Afc and for the normal distribution of ^ A^ using Lemmas 2 and 3 
combined with results for order statistics by Bose and Gupta (1959) and Govin- 
darajulu (1963) summarized in Johnson, Kotz, and Balakrishnan (1994) and using 
Mathematica. 

We also obtained a closed form expression for the cigcnsystem for the distribution 
introduced in the next example. It is an example of a distribution F for which hp is 
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square integrable but not trace class, i.e., by Lemma 5, ^ A| is finite but "^ \k = oo. 
We are not aware of any previous studies of this distribution. 

Example 2. With V standard normally distributed, let Z be defined as the 
following function of V : 

Z=\J\j cxp{ty2)dt 
where the convention is used that for a > b, 

f{t)dt = - / f{t)dt 

Jb 

Close to zero, Z has approximately a normal density, but for large values the density 
is much lower, that is, Z has much heavier tails than a standard normal. 

With F the distribution function of Z , we now show that hp is square integrable 
hut not trace class. To show this, we shall derive an expression for f[F~^{u)] to be 
plugged into Equation (2.21) so that it can be solved. The derivation involves the 
so-called complex error function. The error function is defined as 



erf(z) == \/27r / exp{-t'^)dt 
Jo 



Note that the CDF of the standard normal distribution is <i>(w) — (1 + crf(w/v2))/2. 
The imaginary error function is defined as 

erfi(z) = — icrf(«z) 

(Here i = \/—l. See Weisstein, 1999, for some of the properties ofeii and erfi.j We 
define the inverse erfi~ (z) as the unique real y satisfying z = erfi(j/). 
We see that 

Z = TT erfi —^ 

Now .since V has a .standard normal distribution we obtain for the CDF of Z : 

F{z) = P{Z <z) = p{y < ^/2erfi"^(z/7^)') = $ (\/2erfi"^(z/7r)'j 
From this, 

F-'^{u) = TTCrfi (^-^{u)/V2) 
Some tedious but straightforward algebra then gives 

f{F-\u)) = l/c^[^-\u)r 
Plugging this expression into (2.21) leads to the solution 

Afe = 1/fc 



qk{u) = Hk [^-Hu)] VW^iu)] 

where Hk is the kth Hermite polynomial. The '= ' symbol is used to indicate that 
the expression for qk needs to be suitably normalized. This solution of (2.21) was 
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Distribution 

Afe 
EAfc 

Ai/EAfe 
A2/EAfc 
As/EAfe 
A4/EAfc 

Distribution 



Logistic 



Uniform 



Normal 



1 

1 _ 73-1 



0.5269 
0.1635 
0.0795 
0.0470 



u{l — u) 
1 

k{k+l) 
1 

V2k+lPk{2u-l) 

0.5000 

0.1667 

0.0833 

0.0500 

Exponential 



1 

1 

1 

6 

J_ 

90 

2cos(/i;7ru) 

0.6079 

0.1520 

0.0675 

0.0380 

Laplace 



Chi-square Example 2 



yf[F-Hu)] 

Afe 

EAfc 
EA| 

qk{u) 

Ai/EAfe 
A2/EAfc 
As/ EAfc 
A4/EAfc 



1-u 



min('u, 1 — u) 



0[$-i(^)] 



1 

1 

12 



3 

0.1458 



PkJo{(^kV^ 

0.5453 

0.1627 

0.0774 

0.0451 



u) 



0.4611 
0.1816 
0.0875 
0.0542 
Table 1 

Eigenvalues and eigenvectors of kernel function hp for various 

"' No closed form expression is available. 

" Expression not normalized. 



a 


1 

k 


0.6360 


oo 


0.1399 


6 


a 


Hk 


0.5567 





0.1615 





0.0758 





0.0438 






\u)r 



[$-!(«)] V0[*-H«)] 



derived by De Wet and Venter (1973), who provided a method for solving differential 
equations of the form -^w{u)q'f.{u) + X'j^ qk{u) — Q for certain types of weights w{u), 
including w(u) — l/(j)[<^~^(u)Y . Note that for gk we obtain 

gk{x) = Hk V2 ctH^^ (t: z) 



V2CTfL^\TTz) 

It is well-known that here E Afc is divergent and E A| = 7r^/6. 



The differential equation (2.21) with f[F~^{z)] replaced by a weight function 
w{z) is given in Anderson and Darling (1952), see also De Wet and Venter (1973) and 
De Wet (1987). We are not aware of equation (2.20) having been studied previously. 

2.5. Discrete approximation of the continuous case For many continuous dis- 
tribution functions F the differential equations (2.20) and (2.21) do not have a 
closed form solution and the first t eigenvalues and eigenfunctions can be approxi- 
mated by using a discrete approximation of F and solving the difference equations 
of Lemma 7. For i ~ 1, ... ,t, let Ci = F^^ { ^—-J— ) and let Z^*) be a discrete ran- 
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A 



10 
"^100 



A 



1000 



True value 

1 

0.28987 

0.50000 

9.0909x10-3 

9.9010x10-5 
9.9900x10-^ 



Estimate {t = 

0.99303 

0.29027 

0.50370 

9.3093x10-3 

9.9708x10-5 



101) 



Estimate (t = 

0.99931 

0.28988 

0.50035 

9.1056x10-3 

9.9145x10-5 

9.9970x10-'^ 



1001) 



Table 2 
Eigenvalues and their estimates based on discrete approximations for the kernel hp with F the 
logistic distribution function. 



dom variable with P{Z'-'^^ ~ q) ~ Pi — F{i/t) — F{{i — l)/t). The eigenvalues and 
eigenvectors of hp(t) , with F^*' the distribution function of Z^*\ can then be calcu- 
lated using the method of Section 2.3. For large t, this method seems to give good 
approximations of the eigenvalues and eigenvectors of hp. An idea of the quality of 
the approximations can be gained from Table 2. The numerical results in Table 1 
were obtained using this method. For further details on discrete approximations of 
eigenvalues and vectors of kernels see Tricomi (1985). 

To obtain a good approximation, t should of course be chosen as large as possible. 
Using Mathematica 5.2 on a Pentium IV computer at 3.0MHz, using no special 
routines for calculating the eigensytem of tridiagonal matrices, calculation of a 
complete solution for t — 1000 took 29 seconds. We expect that using software with 
such special routines, it is possible to obtain solutions of the first few eigenvalues 
and eigenvectors much more quickly and for much larger t. 

For calculation of the eigensystem from a sample, see Section 4.1. 

2.6. Relation to Anderson-Darling kernel A related kernel was studied by An- 
derson and Darling (1952) and De Wet and Venter (1973), among others, in the 
context of Cramer-von Mises tests. With w a nonnegative weight function, they 
considered the kernel 



we obtain 



rw(u,v) ^ I ['~i{u,t) -t][^{v,t) -t]w{t)dt 
Jo 

The kernel r^, is closely related to the kernel hp: with w{t) = tip-iif)] ■, 

r^{u,v)=hp[F-\u),F-\v)] 

Note that this conversion does not work for discrete F. The eigensystem for r^ 
is given by the set of solutions to (2.21) with f[F~^{z)] replaced by the weight 
function w{z) (Anderson & Darling, 1952). 

Other closely related kernels have been given in the context of two-sample tests 
by Zech and Asian (2003) and Baringhaus and Franz (2004). (See Section 5.2 for 
the relation between two-sample tests and tests of independence.) 

3. Properties of k and p* We now apply the results of Section 2 in order 
to derive properties of k and p* . In Section 3.1, some key properties are derived. 
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including that < p*{X,Y) <1, with p*iX,Y) = iff X J_ F and p*iX,Y) = 1 
iff X and Y are hnearly related. In Section 3.2 we give a decomposition of k{X, Y) 
and p* {X, Y) as weighted sums of squared correlations between the marginal eigen- 
functions of hpi and hp^, weighted by functions of the eigenvalues. In Section 3.3, 
we describe a decomposition of the likelihood in terms of marginal eigenfunctions 
and component correlations of p*. In Section 3.4, Frechet bound for the component 
correlations are given, which gives some insight into their meaning. 

3.1. Key properties of k and p* Some key properties of k and of p* , are given 
in the following theorem: 

Theorem 2. Suppose X and Y and Z are real random variables for which the 
marginal kernels hpi and hp^ exist. Then: 

1. If a, h, c and d are constants, then K{aX + h,cY + rf) = acK{X,Y) and 
p*{aX + b,cY + d) = p*{X,Y). 

2. k(X, Y)>0 with equality iff X ± Y . 

3. If k{X,X) < oo and k{Y,Y) < oo then k{X,Y) < .yK{X,X)K{Y,Y) with 
equality iff X and Y are a.s. linearly related. 

4-. If both X and Y are dichotomous then k{X, Y) — cov(X, Y)'^ and p* {X, Y) — 
piX^Yf 

5. With Zi-n the ith order statistic in a sample of size n, k{Z, Z) — hE{Z2-A — 

The proof of the theorem is given at the end of this section. Note that k(X, X) ~ 
Ehp^ (Xi, ^2)^ and k{Y^ Y) — Ehp^ (Yi, 12)^ so the condition in Part 3 is equivalent 
to square integrability of the marginal kernels. From Theorem 2, Parts 2 and 3, we 
immediately have the following: 

Corollary 2. Suppose k{X , X) < 00 andK{Y,Y) < 00. ThenO < p*{X,Y) < 
1, with p*{X,Y) ^ iff X JL Y and p*{X,Y) = 1 iff X and Y are a.s. linearly 
related. 

We may compare Corollary 2 to the related well-known result for the ordinary 
correlation p: If var(X) < 00 and var(y) < 00 then < p^ < 1 with p^ = if 
X JL Y and p^ — I iS X and Y are a.s. linearly related. The important difference 
is that X JL y is equivalent to p* = but X JL Y only implies p = 0, not vice 
versa. By Part 5 of Theorem 2, k(Z, Z) can be used as measures of dispersion for a 
real random variable Z . Note the relation with the variance, which can be defined 

as£;(Zi:2-^2:2)^. 

Note that, even though p*{X,Y) = 1 iff X and Y are linearly related, p* is 
not a measure of linear association in the following sense: if the slope of the linear 
regression line of Y given X is zero, p*{X,Y) need not be equal to zero. 
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From Lemma 2 and Lemm.a 3, a sufficient condition for p*(X,Y) to exist is 
tfiat EX and EY exist. An example sliowing tliat this condition is not necessary 
is Example 2. Note that the existence of the ordinary correlation p has the much 
stronger requirement of finite marginal variances. Summarizing, we have 

p{X,Y) exists ^ {cr^(^) and a'^iY) exist } ^ {EX and EY exist } ^ 

=> {E{X2a - ^3:4)^ and E{Y2:4 - Ys-a)^ exist } ^ p*{X,Y) exists 

where the one-way implications arc strict. 
We now proceed to the proof of Theorem 2: 



Proof of Theorem 2. Part 1 follows directly from the definition. 
Part 2: From Lemma 1 we obtain 

hF,{a,b) = / [■y{a,w) - Fi{w)][-f{b,w) - Fi{w)]dw 

Furthermore, note that 

Fuix, y) - F^{x)F2{y) = E[^{X, x) - F^{x)][^{Y, y) - F2{y)] 

which is easy to verify. Using these results and the finitcness of each side of (2.9) 
which allows us to apply Fubini's theorem, we obtain 

k{X,Y) = EhFAXuX2)hF,{Yi,Y2) 

^ E f[j{Xi,x) - Fi{x)MX2,x)-Fiix)]dx X 

[7(^1,2;) - F2{y)MY2,y) - F2(y)]dy 

^ J E[j{Xi,x)- Fi (x)] [7(ri ,y)-F2 (y)] x 
E[jiX2,x) - F,{x)MY2,y) - F2{y)]dxdy 
(3.22) = [[Fi2{x,y)^Fi{x)F2{y)Ydxdy 



li X ^LY , the integrand is zero so k(X, F) = 0. It remains to be shown that XA/Y 
implies k{X, Y) ^{). We next sketch the proof. 

If XU/r then there is an (a, b) such that D{a, b) = Fi2(a, b) - Fi (0)^2(6) 7^ 0. 
We now show that if D{a, b) ^ 0, then there is an open interval, which has (a, b) as a 
limit point, such that D ^ on that interval. It then follows that (3.22) is nonzero. 
Let £12 > be the probability mass in (a, 6), ei > be the marginal probability 
mass in a and £2 > be the marginal probability mass in b. Denote by D{a^ , b^) 
the limit approaching from anywhere in one of four open 'quadrants' defined by 
(a, b). Then from the definition of F12, Fi and F2 given in the introduction, 

Dia-,b-) = D{a,b) - {S12 + ^£1^2(6) + §£2^1(0) - i£i£2 

Dia-,b+) ^ D{a,b) + i£i2 + ^£1^3 (6) - i£2Fi(a) + i£i£2 
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D{a+,b-) = D{a, b) + ieia - h^iF^ib) + ^£2^1 (a) + ^£162 

D{a+,b+) = D{a,b) + |£i2 - ^£1^3(6) - i£2Fi(a) - i£i£2 

Now if D{a^ b) y^ these four expressions cannot all be zero, so there must be an 
open set, in at least one of the four 'quadrants' and with (a, b) as a limit point, 
where D is nonzero. Hence, (3.22) cannot be zero. 

Part 3: By definition )i{X,Y)^ < k{X,X)k{Y,Y) is equivalent to 

[EhF,{Xi, X2)hF,{Yi,Y2)f < EhFAXuX2fEhF,{Yi,Y2f 

This is a Cauchy-Schwartz inequality so it holds, and equality holds iff 

(3.23) hFMi.X2)=-chF,{yi.Y2) 

for some constant c. If Y =' aX + b for certain constants a and b then it is imme- 
diately verified that (3.23) holds with c = \a\. 

The reverse implication that (3.23) implies linearity remains to be shown. Sup- 
pose that (3.23) holds. Then 

hFi{Xi,X2) — hFi{Xi,X^) — hFi{X2,X4) — hFi{X^,X4) — 

c[hF,{Y,,Y2) - HfAYi^Ys) - hF,(Y2,Y4) - hF^Xs^X^)] 
which reduces to 

\Xi - X2I - |Xi - X3I - 1X2 - Xi\ + 1X3 - X4I "=i- 
c(|yi - Y2\ - |Yi - Ysl - \Y2 - Yi\ + \Y3 - Yi\) 
But this is equivalent to 

(3.24) X3..4 - X2-A =■ C (115:4 - Y2-a) 

Now without loss of generality suppose Y^a — cX^a + b and Y2:4 — cX2-a + b' for 
some b and b' . Substitution into (3.24) then yields b ~ b' , so the second and third 
order statistics for X and Y are linearly related. Now the distribution function of 
the second order statistic for X is 



Fl;2:4(a;) - / Fi{t)[l - Fi{t)YdFi{t) 

and for Y 

F2My) - / F2{t)[l - F2{t)fdF2{t) 
J — 00 

It is now straightforward to show that the equation -^2:2:4(2/) — Fi-2-a{cx + b) leads 
to F2{y) — Fi{cx + b), so X and Y are linearly related. 

Part 4: Without loss of generality assume X G {0, 1} and Y G {0, 1} with 
probability one. Then Hf^ = uf^ and /i^a — UF2I (both u and h defined in Section 1), 
so k{X,Y) = EhF,{XuX2)hF,{Yi,Y2) ^ EufAXi,X2)ufJYi,Y2) ^ coy{X,Y). 

Part 5: Since k{Z, Z) = EhF{Zi, Z-if' ^ this follows directly from Lemma 2 D 
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We conclude this section by giving some representations of k in terms of hp and 
the (conditional) distribution functions of X and Y . Let 

F2\i{y\x) = P{Y < y\X = x) + iP(r = y\X = x) 

be the conditional distribution function of Y given X — x. 

Lemma 10. The following equalities hold for k: 

1. n{XX) = J[Fi2{x,y) - Fiix)F2iy)]^dxdy 

2. KiX,Y) = EhF,{Xi,X2)J[F2ii{y\X,) - F2(y)][F2|i(y|X2) - F2{y)]dy 

Proof. Part 1: This follows from the proof of Theorem 2, Part 2 
Part 2: First note that dxyFi2{x,y) = dxFi{x)dyF2\i{y\x) which we write in 
shorthand dFi2{x,y) = dFi{x)dF2\i{y\x). Hence, 

KiX,Y) = EhFAXl,X2)hF,iYi,Y2) 

hFAxi,X2) / [7(2/1, y) - F2{y)]["f{y2,y) - F2{y)]dydFi2{xi,yi)dFi2{x2,y2) 



hFi{xi,X2) / [7(2/1,?/) - F2{y)]dF2ii{yi\xi)["f{y2,y) - F2{y)]dF2\i{y2\x2)dydFi{xi)dFi{x2) 

hF,{xi,x2) j[F2\i{y\xx) ~ F2{y)][F2\i{y\x2) - F2{y)]dydFx{xi)dFi{x2) 

= EhFAXi,X2)J[F2iAy\Xi) - F2{y)][F2ii{y\X2) - F2{y)]dy 

D 

Note the similarity of Part 1 of Lemma 10 with the formula for the covariance 
given by Hoeffding (1940): 

cov{X,Y) = j[Fi2{x,y) - Fi{x)F2{y)]dxdy 

3.2. Orthogonal decomposition Let us assume Hfi and hF2 are square integrable 
and have the spectral decompositions 

00 

(3.25) /i_Fi(xi,a;2) ^'^\k9ik{xi)gik{x2) 

k=0 

00 

(3.26) hF2{yi,y2) = ^Mfc52fc(2/i).92fc(2/2) 

fc=0 

See Lemma 2 on how to check for square integrability. For ease of notation, we write 
the correlations between marginal eigenfunctions as 

Pki{X,Y)=p[gik{X),g2i{Y)] 

We now have the orthogonal decomposition given as follows: 
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Theorem 3. Suppose hpi and /1F2 are square integrable with spectral decom- 
positions as above. Then with convergence in mean square, 



K{X,Y)=Y,Y.^kt^iPki{X,Yf 



k=l 1=1 



and 



p*{X,Y) 



-. 00 00 



Proof. Write 

K^^-\X,Y)^E 



N 



N 



^ Afe5ife (Xi)gik (X2) ^ pig2i {Yi )g2i {Y2, 
fc=i 1=1 

N 

hp,{Xi,X2)J2f'i92i{Yi)g2i{Y2) 



1=1 



N 



J2\kgikiXi)gikiX2) hF,{Yi,Y2) 



.fe=i 



Then straightforward algebra gives 

N N 

«(~'^)(X, r) ^ ^ ^ Afe Mi p[gik{X),g2i{Y)f 

k=l 1=1 

By the Cauchy-Schwartz inequality we obtain 

U{X, Y) - «("'■) (X, Y) - k(-^^) (X, Y) + k(^^^) (X, y )) ' 



E 



< E 



N 



hFi{Xi,X2) - y^ Afcffifc {Xi)gik {X2) 



fe=i 

AT 



TV 



hFAYi,Y2) -J2l^kg2i{Yi)g2i{Y2, 



hFiiXi,X2) - X! Afcgifc {Xi)gik (X2) 



*;=! 



E 



1=1 

N 



hFAYi,Y2) -J2tikgAYi)g2iiY2) 



By mean square convergence of the spectral decomposition the latter goes to zero 

as iV ^ 00 so 

k{X, Y) - k(^")(X, Y) - k(-'^)(X, Y) + k(^'^)(X, r) ^ 
as iV -^ 00. Similarly we find 

k{x,y)-k'--'^Hx,y)^o 

as TV ^ 00. It follows that 

as iV -^ 00, which is the desired result. D 
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The simplest example of a decomposition is if both variables are dichotomous, 
say P{X = 0) = 1 - P{X ^ I) = p and P(Y = 0) = 1 - P{Y = 1) = g, we 
obtain Ai = 2p{l — p), /ii = 2q{l — q), and Afe = /i^ = for fc > 1 so that 
p*{X, Y) = p{X, F)^, see Theorem 2, Part 4. In this special case the decomposition 
consists of just one component. 

3.3. Parameterization of the likelihood Let /12 be the joint density of {X,Y) 
with corresponding marginal densities /i and /2. Since 

Pki^ / f f \f I \ 9ikix)g2iiy)dFi{x)dF2{y) 
J h{x)h{v) 

we can decompose the joint density as: 



CO 00 



(3.27) /i2(x,y) = /i(x)/2(2/) 1 + J] ^gi/c(x)52/(y)Pfci 

A similar equation can be given for discrete distributions, and a general treatment 
can be given using the Radon-Nikodym derivative. 

Decomposition (3.27) may be compared to the well-known canonical correlation 
decomposition 



fi2{x,y) ^ fi{x)f2{y) l + ^aik{x)a2k{y)p, 



•k 
\ k=l / 

Here, aik and a2k are those functions maximizing the correlation between X and Y , 
subject to the restraint (for fc > 1) that they are orthogonal to an, . . . , ai^k-i and 
021, . . . , a2,fc-i, respectively, and pk is the correlation between aik{X) and b2k{Y). 

3.4. Frechet bounds for component correlations Below, we discuss the inter- 
pretation and properties of the component correlations. In particular, we look at 
bounds for the component correlations. 

For two random variables X and Y with joint distribution function F12 and 
marginal distribution functions Fi and F2, the well-known Frechet upper bound 
F^2 is defined by 

F+{x,y)^mm{F,{x),F2{y)} 

and the Frechet lower bound F^2 is defined by 

Ff2(x,y) = max{0, 1 - F,{x) - F2{y)} 

Then piX,Y) ^ 1 if and only if F12 = F+ and p{X,Y) = -1 if and only if 
F12 = F^2- A more general question is, for functions g and h, for which F12 the 
correlation between g{X) and h{Y) is maximal or minimal. Let 

Sl^^{{x,y)eJ\.^\g{x)^Hy)} 

and 

S-^^^ {{x,y) eB?\g{x) ^ ^h{y)} 
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Then we have: 

Lemma 11. For functions g and h, p[g{X),h(Y)\ ~ 1 iff the support of the 
distribution of{X,Y) is a subset of S^f^ and and p[g{X)^h{Y)] = —1 iff the support 
is a subset of S^f^. 

Proof. Suppose for simphcity that g and h are standardized. Then 
p[g{X),hiY)] = 1 iff P[g{X) =' h{Y)] '= 1 and p[giX),h{Y)] = -1 iff 
P[g{X) = —h(Y)] = 1, and the lemma immediately follows. □ 

For the dichotomous case we obtain the following: 

Example 3. // both variables are dichotomous, say P{X = 0) = 1 — P{X = 
1) = p and P{Y =. 0) = 1 - P{Y ^ 1) ^ q, we obtain S^^ = {(0,0), (1, 1)} and 
5ri = {(0,l),(l,0)}. 

Note that the bounds need not be attainable since it may be the case that, for 
example, for certain x, there is no y such that {x,y) G Sl'i^. 

Here we are interested in the bounds for the component correlations pij of p* . 
For simplicity, we write S~j = S^. . . The next example shows that if both X and 
Y have uniform distributions on [0, 1], then the component correlations of p* can 
attain the bounds 1 and —1: 

Example 4. Suppose X and Y are uniformly distributed on [0,1]. Then the 
eigenf unctions are the Fourier cosine functions (see Table 1). The set S^i is formed 
by the solutions {x, y) to the equation 

cos(A:7ra;) — cos{liry) 

and Sj^i by the solutions of 

cos{kTTx) ~ — cos{lTry) 

The solutions are plotted in Figure 2, for k ~ 1, . . . , 4 and I ~ 1, . . . , 4. The bounds 
for the pki are attainable since pki {X, Y) ~ 1 for the uniform distribution on S'^i 
and pki {X, Y) = —I for the uniform distribution on Sj^i . 

Note that, if pu = I, then p22 = 1, since 5^ C 5^. More generally, by the same 
reasoning, we have for all i > I, 

Pkl{X,Y) = 1 => p^^kM{X,Y) = 1 



Pkl{X,Y) = -1 ^ p^y,kaxl{X,Y) = i-iy 

By a symmetry argument, we also have pn = 1 => pi2 = 0, and there are various 
other similar implications. 

An overview of a large amount of literature on Frechet bounds is given in 
Riischendorf (1991) 
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(a) Positive bounds S^-: pij = 1 if support of 
(X, Y) is subset of S± 




(b) Negative bounds S^-: pij = 1 if support of 
{X, Y) is subset of S~. 



Fig. 2. Supports of the Frechet bounds for the component correlations pij of p*{X,Y) when X 
and Y are uniformly distributed on an interval. 
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4. Estimation and tests of independence In this section we discuss esti- 
mation of K and p* by U- and V-statistics. Roughly speaking, the U-statistic esti- 
mator of a parameter is an unbiased estimator based on taking averages (Hoeffding, 
1948a), and the V-statistic estimator is the estimator based on the distribution ob- 
tained by assigning a probabihty weight 1/n to each sample point. For k, both the 
U- and V-statistic estimators are available, but for p* only the latter is. However, 
we can estimate p* by a function of U-statistic estimators. 

In Section 4.1, it is shown how estimators of k. and p* by U- and V-statistics are 
obtained. In Section 4.2 permutation tests, useful for small samples, are described. 
In Section 4.3, the asymptotic distribution of these estimators is derived under the 
null hypothesis of independence. In Section 4.4, Bonferroni corrections for tests of 
significance of the component correlations are described. 

4.1. U and V statistic estimators of k We first give a method for calculating 
the U- and V-statistic estimators of k based on a sample (Xi, yi), . . . , (X„, y„), 
then we give the related estimates for p* . 

The V-statistic estimator k is the value of k based on the sample distribution 
functions Fi and F2, and is obtained as follows. Let 



1 " 
Aik — — } ^ \Xk — Xi 

i—l 

1 " 
n ^ — ^ 



n 

1=1 



and 



i=i ]=i 

^ n n 



B2 

n 

i=i 0=1 



Then we have for k,l — 1, . . . ,n, 

hp^{xi,X2) = -i {\xi - X2\ - Aik - All + Bi) 
hp^{yi,y2) = -^{\yi -2/2! - A2k - A21 + B2) 

and the sample or V-statistic estimator of k is given as 

1 " 

Now with 

hp{xi,X2) = -^ ( \xi - X2\ '^Aik '^-—Aii H '^Bi 

1 V n— 1 n— 1 n— 1 



n^ 
«,j=i 



n , n , n 
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hp^(yi,y2) = -h m -y-A -:Mk -Mi H -B2 

n — V n — V n — 1 



for fc, ? = 1, . . . , n, the unbiased or U-statistic estimator of k is given as 

^ n— 1 n 

*- ' i=l j=i+l 

By Hoeffding's theory of U-statistics we have that k is an unbiased estimator of k 
(Hoeffding, 1948a; Randies & Wolfe, 1979). Note that k > but k may be negative. 
The related estimators of p* are the following: 

kix,Y) 



p*iX,Y) 
p*{X,Y) = 



^kiX,X)k{Y,Y) 
kiX,Y) 



^k{X,X)k{Y,Y) 

For both types of estimators, the computational complexity of the above method 
is 0{n^). 

The marginal eigenvalues and functions can be computed numerically from h^ 
and hp or from hp and hp . See also Section 2.5 for computational aspects. 

4.2. Permutation tests Under independence, the sample marginal distributions 
of X and Y are ancillary statistics for p* and p* , so by Fisher's theory of fiducial 
inference we should condition on the sample marginals when testing for indepen- 
dence using p* and p* .li X JLY, conditioning on the marginals ensures that p* and 
p* are distribution free, and exact conditional p-values can be calculated using the 
permutation method. Evaluating all permutations quickly becomes computation- 
ally prohibitive even for moderately large sample sizes, and we recommend using a 
set of random permutations. Note that permutation tests may also be applied to 
the component correlations pij and pij . 

Permutation tests may be computationally intensive. Using non-optimized soft- 
ware, our experience shows that (bootstrap) permutation tests for up to a several 
hundred observations are feasible: for n ~ 100, the permutation test based on 1000 
random permutations took less than four minutes, and for n — 500 it took a bit 
more than one hour. Techniques for the fast exact evaluation of permutation tests 
using generating functions are described by, among others, Baglivo, Pagano, and 
Spino (1996) and Van de Wiel, Di Bucchianico, and Van der Laan (1999), but it 
is not clear whether these techniques extend to statistics such as p* which are not 
based on ranks. 

For categorical data the permutation test is better known as an exact conditional 
test (where the conditioning is, again, on the marginal distributions), the Fisher 
exact test being the best known example. There is a large body of literature on 
fast evaluation of exact conditional p- values for contingency tables, for an overview 
see Agresti (1992) and for more recent developments see Forster, McDonald, and 
Smith (1996), Diaconis and Sturmfels (1998), Booth and Butler (1999). 



A new correlation coefBcient 



29 



If the permutation test is too computationally intensive, an asymptotic test may 
be done using the results of the next section. 

4.3. Asymptotic distribution of estimators under independence For the asymp- 
totic distribution of the estimators we obtain the following: 

Theorem 4. Suppose hp-^ and hF2 or^ square integrable with spectral decompo- 
sitions (3.25) and (3.26). Then if X JL Y and with Zij iid standard normal variables, 
we obtain 

oo 

nk{X,Y)^D Y,\^^l,{Zl-\) 
If additionally hp-^ and hp^ are trace class, we obtain 



z{X,Y)^D Y. ^«^^^' 



Proof. By the Hoeffding (1961) decomposition we can write with R„ 
Oin-"^) 

-1 



^ /iFi {xi , Xj )hF^ {yi ,yj) + R„ 



Y^ {Y>^k9lk{x^)glk{xJ)\ iYfj'i92i{yi)92i{yj)j -\- Rn 

l<*<i<« \k^l / \l^l J 

/ \ 2 



-1 oo 



Ygikixt)g2iiyt) -Y9ik{xt)^g2i{yi 



Rn 



Since n ^ Ya=i 9ik{xiY g2i{Vi)'^ —> 1 a.s., and n ^ {YJl^igik{xi)g2i{yi)) -^d Zli 
we obtain using the Cramer- Wold device that 

oo 

The proof for k is similar; we have 

1 " 
k= —^hp^ {xi,Xj)hp^ {yi,yj) + R„ 

_.n/oo \/oo \ 

= — X! 'Y^k9ikixi)gik{xj) j ^M;52/(2/j)52z(yj) j 

_. oo / n \^ 

= — X! ^'^'^' {Yglk{x^)g2liy^) j + Rn 
^ k,l=l \i=l J 



Rn 
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Since n~^ {^^=i 9ikixi)g2i{yi)) -^d Z^.^ we obtain using the Cramer- Wold device 
that 



nn 



i,j=0 



under the condition that 



00 00 



hm E{nk) = > Ai /iw = 7 A^ 7 /i, 

n^oc ■'- — ^ -*- — ^ ■'- — ^ 

'ij=0 1=1 j=0 

is finite. Now by Lemma 5, the two factors on the right hand side are finite iff hp^ 
and /ii?2 are trace class, completing the proof. □ 

The proof is similar to an adaptation by De Wet (1987) of a proof by Eagleson 
(1979). See also Gregory (1977) and Hall (1979). 

Note that by Lemma 3, hp^ and /ii?2 are trace class iff EX and EY exist. As 
follows from the theorem and noted earlier by Dc Wet (1987) for related statistics, 
the U-statistic estimator has an asymptotic distribution in more cases than the 
V-statistic estimator. For example, if the marginal distribution of at least one of X 
and Y is the distribution of Example 2, nk does not have an asymptotic distribution 
but nk does have one. 

4.4. Bonferroni corrections for testing significance of component correlations As 
well as testing the significance of p* directly, we can test for the significance of the 
empirical component correlations pij. We recommend using p* rather than p* for 
calculating component correlations, since p* may be negative in which case no 
component correlations with nonnegative eigenvalues exist. 

The proof of Theorem 4 suggests that the component correlations pki are asymp- 
totically normal and independent. Since there are many component correlations, a 
simultaneous test of their significance needs a Bonferroni correction. The ordinary 
Bonferroni correction, i.e., multiplying the exceedance probabilities by the number 
of tests done, which in this case is n^, would be unreasonable since the multipli- 
cation factor increases rapidly with n. Instead we propose dividing the exceedance 
probability for pij by 



A, 



iA*j 



En ? ^—~\n 
1=1 ^i l^j=i Mi 

Note that these numbers will converge to zero in probability if at least one of hp-^ 
and hp^ is not of trace class, i.e., by Lemma 3, if at least one of EX or EY does 
not exist, in which case the correction may not be the most appropriate one. 

The idea of looking at components of a test seems to have first appeared in 
Durbin and Knott (1972), where components of the Cramer-von Mises test were 
investigated. This test is a special case of the tests based on p* described above (see 
Section 5.2). 
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Other related work is by Kallcnberg and Ledwina (1999), who looked at corre- 
lations between orthogonal functions of the marginal cumulative distribution func- 
tions, in particular, the Legendre polynomials. This work is an extension of the 
so-called smooth tests of fit of Neyman (1937). Rather than looking at all corre- 
lations between the orthogonal functions, they considered just the first few, and 
developed a selection method based on Schwartz's rule for determining how many 
correlations to base the overall test on. 

5. Grade versions of k and p*, copulas, and rank tests For ordinal ran- 
dom variables X and Y , any given scale is arbitrary and it may be desirable to use 
scales based on the grades Fi{X) and F2{Y) of X and Y. A general way to base 
K and p* on grades is as follows. For given invertiblc distribution functions Ki and 
K2, we can define 

kk,^kAX,Y) = k[K^' o Fi{X),K^' o F^iY)] 

and 

te„K.(^,n = P*[K^' o F^{X),K^' o F2{Y)] 

Note that 

KF,MX,Y) ^ k{X,Y) 

and 

P*F,,F,{X,Y)^p*{X,Y) 

With Ki and K2 uniform distribution functions, p*j^ ^ (X, Y) is to p* what Spear- 
man's rho is to the ordinary correlation p. 

We can use the results of Section 4 to obtain an orthogonal decomposition of 
kki,K2 ^-nd p*j^ j^ in terms of component correlations. These component corre- 
lations then parameterize the copula, which is defined as the joint distribution of 
(i^i(X), F2(F)). From (3.27), and since the marginal eigenfunctions of hp with F 
the uniform distribution are the cosine functions given in Table 1, we obtain the 
following decomposition of C12 , the density function of the copula: 

00 00 

Cl2(w, w) = 1 + ^ y COs{kTTu) COs{l'!Tv)pkl 

where pki = J cos{kTTu) cos{lTrv)ci2{u, v)dudv. This decomposition was earlier given 
in De Wet (1980) and Deheuvels (1981). An overview of copula theory is given in 
Nelsen (2006). Possible drawbacks of using pj^ ^ for some given Ki and K2 is 
the arbitrariness of any choice of Ki and K2 and the loss of scale information, but 
these issues are hotly debated (Mikosch, 2006). 

In Section 5.1, a brief description of rank tests based on k is given. In Section 5.2 
a generalization of the Cramer-von Mises test to the case of K ordered samples is 
shown to be a special case, and a convenient representation is given. In Section 5.3 
we write kki,K2 &s a weighted average of ^-coefficients. 
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5.1. Rank tests Rank statistics which are distribution free under independence 
in the continuous case are obtained as follows. For invertible distribution functions 
Ki and K2 let 

~Pk,mM^Y) = ~P*\K^' o F,{X),K^' o F2{Y)] 

The derivation of the asymptotic distribution of these statistics is slightly more 
involved than that of the asymptotic distribution of p*{X,Y). De Wet (1980) has 
done this derivation for statistics related to p*j^ ^ {X,Y). He gave the weights for 
optimal tests in the Bahadur sense for certain classes of alternatives, such as the 
bivariate normal. 

With Ki and K2 the uniform distribution functions, np*^^ ^ {X, Y) is a statis- 
tic discussed by Blum, Kicfer, and Rosenblatt (1961), see also Deheuvels (1981). 
It can be viewed as a generalization of the ordinary Cramer-von Mises test (see 
next subsection). Similarly, with Ki and K2 the logistic distribution functions, 
np*j^ j^ {X, Y) can be viewed as a generalization of the Anderson-Darling test. 

Hocffding (1948b) described a related test, namely based on the U-statistic esti- 
mator of 

[F,2 (x, y) - F, {x)F2 {v)f dF,2 (x, y) 

which can be obtained from the representation of k in Lemma 10, Part 1, by re- 
placing dxdy by dFi2{x, y). Hoeffding's coefficient does not fall in the framework of 
the present paper. 

5.2. A new class of K -sample Cramer-von Mises tests as a special case Suppose 
we have K samples, the fcth sample having Uk iid observations, say {Uki, . . . , Ukuk}- 
Then a test whether the distributions of the observations in the different samples 
are equal is called a K-sample test. A X-samplc test can, in fact, be viewed as a 
test of independence, namely, whether 'response' depends on 'group membership,' 
the groups referring to the different samples. Let us consider the case that the score 
Cfc G R is assigned to sample k {k = 1, . . . , K) . With A^o = and A^^ = X]i=i '^i 
let {Xn^^_^+i^,Yn^^_^+i^) = {ci,UkSk) for fc = 1,. .., a: and ik == 1,. ..,nfe. Then it 
can be seen that the K sample test is a test of independence of the X observations 
and the Y observations. (Note that here the X observations are not random). A 
AT-sample test can then be based on p* or p* . 

If samples are ordered but have no numerical scores assigned to them, rank scores 
can be assigned, for example Ck = Nk- 

In order to arrive at the Cramer-von Mises test, we now use Lemma 10, Part 2 to 
give a representation of k in terms of the conditional distribution functions. Let Gk 
be the distribution function of Uk, the response for sample k, and let pk = rik/Nx 
be the proportion of observations in sample k. Then we obtain 

k{X,Y) = ^Kp,/if (c„c,) / [GM - F2{y)] [G,{y) - ^2(2/)] dy 

i,3 •' 
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Some straightforward algebra shows that for the two-sample case this reduces to 

k{X, Y) = pIpI J [Gi (y) - G2 {y)f dy 
A grade version of n is 
kf,,k{X,Y) = ^p,p,;if (c„c,) / [G,{y) - F2(y)] [G,(y) - F2(y)] w[F2{y)]dF2iy) 

where 

"'^"^ = k[K^] 

In the two-sample case, the sample version of kp-^k{X, Y) with K the uniform dis- 
tribution function reduces (essentially) to the ordinary Cramer-von Mises statistic, 
so we have a generalization to the case of K ordered samples. With K the logistic 
distribution, kf^j^ (X, F) reduces to the Anderson-Darling statistic. 

A different generalization of the two-sample Cramer-von Mises test was given by 
Kiefer (1959), namely to the case of K unordered samples. 

5.3. K as a weighted (j)- coefficient From Lemma 10, Part 1, we directly obtain 
(5.28^Ki,K,(X, Y) = f[F,2{x,y) - Fi{x)F2{y)f dK^' o Fi{x) dK^' o F2{y) 



This result leads to an interesting interpretation of hki.K2- The (p coefficient for 
measuring the dependence in the 2x2 table obtained by collapsing the distribution 
with respect to the cut point [x, y) is given as 

(5.29) 0(., ,) = \F^^i-^y)-F^i-)My)\ 

Now suppose -0 is such that 

4>{x,y)^^[F^{x),F2{y)] 

Then from (5.28) we obtain that hki,K2 can be written as a weighted average 
f/'-square: 

KKi,K2{X,Y) = j ip{u,vf WKi{u)wK2{v)dudv 

where the weight function w is defined by 

"^(1 — u) 



wk{u) 



k[K-^{u)] 

The normalized weight function (integrating to one) is 

- / N wk{u) 

Jo WKKUjdu 
where 

1 /•! /"CO 

WK{u)du^ u{l - u)dK-\u) ^ K{x)[l-K{x)]dx 

Jo J-00 
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Fig. 3. Plots of wn{u) for several distributions K 

In Figure 3, Wk is plotted for the distribution functions K given in Table 1. 

From Figure 3 we can deduce which (/)-coefficients are given most weight for 
different marginal distributions. As the reference marginal, we take the logistic, 
which is the horizontal line in the figure, i.e., assigning uniform weights. We see 
that for a uniform marginal, the weight goes to zero in the tails. The weights for a 
normal marginal are for most u intermediate between the weights for the uniform 
and logistic marginals. In contrast to uniform and normal marginals, the Laplace 
distribution gives more weight to the tails than a logistic marginal. An exponential 
marginal gives little weight to the lower tail, but much weight to the upper tail. 
Finally, the chi-square distribution gives very large weight to the lower tail and 
very small weight to the upper tail. Among the distributions considered, the biggest 
difference is between the chi-square and the exponential distribution. 



6. Data analysis: investigating the nature of the association Gaining 
an understanding of the nature of an association between two random variables is 
probably best viewed as an art rather than a science, and in this section we present 
some visual tools based on p* and its components which may be helpful in reaching 
this objective. For an iid bivariate sample {{Xi,Yi)} we propose the following two 
procedures. 

Firstly, we calculate p* from the sample and test its significance. If found to be 
significant, then for each data point {Xi, Yi) we calculate the weight 



W,,^ 



Since 



^k{X,X)k{Y,Y) 

1 " 

p*iX,Y) = -J2W^ 
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the weights Wt give an indication of how much the sample element {Xi,Yi) con- 
tributes to p* , and so can be used to discover the nature of a possible association 
between X and Y. 

Secondly, we calculate the component correlations pki of p* and test their sig- 
nificance using the Bonfcrroni corrections described in Section 4.4. Then for each 
significant component correlation pki, we compute the weights 

W^^''^''> = gikiX,)g2i{Y,) 

where gik and g2i are the eigenfunctions belonging to hpi and hp.-^. Since 



n 



Pki- - ^ ■"^^''''^ 
n 

i=l 



the weight W^ ' is the amount the sample element {Xi,Yi) contributes to pki 
(conditionally on the marginals), and so, like Wi, can be used to investigate the 
association between X and Y. 

Ik I) 

In this section we show how to visualize the weights Wi and W^ ' , both for 
continuous and categorical data, and show how this can be used to gain an under- 
standing of the association. Some artificial continuous data sets are considered in 
Section 6.1, a real categorical data set is considered in Section 6.2 and a real time 
series data set is considered in Section 6.3 

6.1. Some artificial data sets In Figure 4, four artificial data sets are plotted, 
each consisting of 100 iid points. For completeness, we explain how the data were 
generated. In the following, U is uniformly distributed on [0, 1] and Zi and Z2 
are iid standard normal random variables, and Z{u) has a normal distribution 
with mean zero and standard deviation u. The data in Figure 4(a) are from a 
bivariate normal distribution with p = 2/3. The data in Figure 4(b) are of the 
form (X, Y) = (U, {U - 1/2)2) _^ (^Zi/10, Z2/IO). The data in Figure 4(c) are of the 
form {X,Y) = (U,Z{l/5 + U)). Finally, the data in Figure 4(d) arc of the form 
{X, Y) = ([/, Z{l/5 + min{U, 1 - U})). ' 

For all four data sets, we performed permutation tests for the significance of p* 
and its component correlations pij based on 10,000 random permutations. This took 
us about 48 minutes for each data set. For the significant component correlations 
we took 1 million random permutations to get a more accurate p-value, and this 
took about 110 seconds per component correlation. We also computed the ordinary 
correlation, and, not surprisingly, only for the data in Figure 4(a) it is significantly 
different from zero. There we found that p = .44 {p = .000). 

For the data in Figure 4(a), we found that p* ~ .36 {p ~ .000), i.e., there is 
significant association. In Figure 5(a), the data arc plotted again, this time each 
sample element {Xi,Yi) is represented with size proportional to \Wi\; black dots 
represent positive Wi and white represent negative Wi. In all these plots, the total 
area of the dots is scaled to a constant to make it easier to study them. From 
Figure 5 (a), we see that the association seems to consist of a linearity in the data. 
It may be worthwhile however to check if there are other forms of association 
present by looking at the component correlations of p* . We found two significant 
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(a) Increasing location (p* = .36) 



(b) Decreasing then increasing location 
(p* = .17) 



(c) Increasing dispersion (p* = .11) 



(d) Increasing then decreasing dispersion 
(p* = .02) 



Fig. 4. Scatterplots of artificial data sets. The captions denote what happens to the conditional 
Y distribution as X increases. 
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(a) Increasing location (p* = .36) 



(b) Decreasing then increasing location 
(/3* = .17) 





(c) Increasing dispersion (p* = .11) 



(d) Increasing then decreasing dispersion 
(P* = -02) 



Fig. 5. Representation of weights Wi for data in Figure 4- The size of each dot is proportional 
to \Wi\; black dots represent positive Wi and white represent negative Wi. Total area of dots is 
scaled to a fixed constant for each plot. 
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Mental Health Status 


Parents' 




Mild 


Moderate 




Socioeconomic 




Symptom 


Symptom 




Status 


WcU 


Formation 


Formation 


Impaired 


A (high) 


64 


94 


58 


46 


B 


57 


94 


64 


40 


C 


57 


105 


65 


60 


D 


72 


141 


77 


94 


F 


36 


97 


54 


78 


G (low) 


21 


71 


54 


71 



Table 3 
Cross- classification of Mental Health Status and Socioeconomic Status 



components: pn = .61 (p = .000) and p22 — -38 {p — .004). In Figures 6(a) and 6(b) 

fill f22') 

the weights W^; and W^ are visualized. The interpetation of the black and white 
dots is the same as above. The gridlines correspond to the zeroes of the marginal 
eigenfunctions. Therefore, within any rectangle the dots have the same color. Also 
in this case, the plots point to a linearity in the data. 

For the data in Figure 4(b), we found p* = .17 (p = .000) and we found two 
significant component correlations: p2i = —.78 {p = .000) and p42 — -54 (p = .000). 
The plots in Figures 5(b), 6(c) and 6(d) all point to a curved relationship. 

For the data in Figure 4(c), we found that p* = .11 {p = .001). There is significant 
association at the 5% level, but the evidence is not as overwhelming as in the 
previous two cases. We only found one significant component correlation: pi2 — —.51 
(p = .000). Figures 5(c) and 6(e) both point towards and increase in dispersion of 
the Y variable as X increases. 

For the data in Figure 4(d), wc found that p* — .02 (p = .522). This time, 
the test based on p* does not yield a significant association. However, there is one 
significant component correlation: p22 = —.38 (p = .004). Figure 6(f) indicates that 
the association is due to an increase and then decrease in dispersion. Since p* is not 
significant, we should refrain from giving an interpretation based on Figure 5(d). 

For comparative purposes, we plotted the weights Wi for data drawn from a 
distribution satisfying independence in Figure 7. Both sets consist of 100 sample 
elements. The most common pattern is that of Figure 7(a), with two diagonally 
opposing clusters of black dots and two diagonally opposing clusters of white dots. 
In a very limited investigation, this type of pattern occurcd about half of the time. 
Otherwise more complex patterns were obtained, such as the one in Figure 7(b). 
These figures indicate that it doesn't seem to make sense to interpret this kind of 
plot if p* is not significant, such as Figure 5(d). 

Concluding, we see that significance tests combined with an inspection of the 
two types of plots in Figures 5 and 6 can give us insight into the kind of association 
present between two random variables. 
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(a) pii = .61 (data from Figure 4(a)) 



(b) p22 = -38 (data from Figure 4(a)) 
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(c) P21 = —.78 (data from Figure 4(b)) 



(d) p42 = .54 (data from Figure 4(b)) 
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(e) pi2 = —.51 (data from Figure 4(c)) 



(f) P22 = -.38 (data from Figure 4(d)) 



(k I) 
Fig. 6. Representation of weights W ' contributing to pj.; for data in Figure 4- The meaning 

of the dots is otherwise the same as in Figure 5. 
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(a) Most common pattern 



(b) Loss common pattern 



Fig. 7. Representation of weights Wi for data drawn from distributions satisfying independence. 






(a) /3* = .02 



(b) pii = .13 



(c) P13 = .08 



Fig. 8. Representation of weights Wab, M^^b ""'^ ^ab ' f'^spectively , for mental health data 
of Table 3. The darker the shade of gray in cell (a, b) the larger Wabl ^ab = is represented by 
an intermediate shade of gray. 
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6.2. Mental health data Tabic 3 describes the relationship between child's men- 
tal impairment and parents' socioeconomic status for a sample of residents of Man- 
hattan (Goodman, 1985; Agresti, 2002, and references therein). Goodman used this 
table to illustrate various association models for categorical data, including the so- 
called linear by linear association model, the row and columns effects model, and 
correspondence analysis based on canonical correlations. Here we illustrate the use 
of p* and its components as yet an alternative method for analyzing these data. We 
relied on asymptotic p-values because with 1670 observations approximate eval- 
uation of the permutation tests would have been too time consuming using our 
implementation. 

In the categorical case, for an / x J contingency table, it suffices to calculate 
weights for the cells, i.e., it is not necessary to calculate separately a weight for 
each individual observation. For an observation {Xi, Yi) in cell (a, 5), the weight Wi 
reduces to 

ELi E/=i P^3hp^ [i, a)hp^ (j, b) 



Wab =Pab- 



^k{X,X)k{Y,Y) 



where Pat is the proportion of observations in cell (a, 5). Similarly, the weights 
belonging to component correlation pki are 

^ab ^ Pabgik{a)92l{b) 

ior a — I, . . . ,1, b ^ I, . . . ,J, k — 1, . . . ,1 and I ~ 1, . . . ,J. Note that 

/ .7 



a=l 6=1 



and 



p--EE^r 

a=l 6=1 

We found that p* = .02 {p = .000), i.e., there is significant association in the data. 
The weights Wab for the cells are represented in Figure 8(a). Here, the grayscale 
represents the size of Wab- the darker the cell, the larger Wab', Wab = is represented 
by a fixed intermediate shade of gray. From Figure 8(a), it can be seen that most 
of the association is of a monotone nature: the higher the parents' socioeconomic 
status, the better the mental health status of their children. We also investigated 
the component correlations and found two components to be significant at the 5% 
level: pn = .13 (p = .000) and pig == .08 (p = .026). In Figures 8(b) and 8(c) 
we represented the VF^^ and VF^^ using grayscales as above. From Figure 8(b), 
we see that pn indicates linearity again. However, in Figure 8(c) we see evidence 
of some nonlinearity in the data, namely an apparent reversal of the association 
if only the middle categories 'Mild Symptom Formation' and 'Moderate Symptom 
Formation' are considered. Hence, it appears that the association which is present 
in the data cannot be fully explained by linearity. 
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(a) Time series of stock price index 
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(b) Scatterplot of (Zt,Wt) showing association 
between successive jumps in Figure 9(a). p* = 
.09 




(c) Weights Wi for data in Figure 9(b) 



Fig. 9. Monthly Norwegian stock price indices, 1914-2001 
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6.3. Norwegian stock exchange The Norwegian stock exchange data represented 
in Figure 9(a) yield an example with an especially interesting form of association. 
In Figure 9(a) the original time series data Yt are plotted. In Figure 9(b) we plotted 

{Zt.Wt) = (arcsinh(rt - li^i), arcsinh(rt+i - Yt)) 

The arcsinh transformation was done to make the marginal distributions less heavy 
tailed. We found a highly significant association with p*{Z,W) = .09 {p = .000). 
From Figure 9(b) we can already interpret the association: large jumps (up or down) 
in stock prices tend to be followed by large jumps, and small jumps by small jumps, 
indicating periods of volatility. In Figure 9(c), the weights Wi are represented by 
the size and color of the dots (see Section 6.1 for further explanation). This plot 
points to the same conclusion that large jumps are followed by large jumps and 
small jumps by small jumps. Note that in this case, not only the positive weights 
(the black dots) but also the negative weights (the white dots) are highly indicative 
of association. The plot indicates that the up-arm (with the black dots) is 'heavier' 
than the down-arm (with the white dots), that is, there is evidence that in the data 
generating process a jump tends to be of the same sign as the previous jump. 

Seven component correlations were found to be significant at the 5% level after 
applying the Bonferroni correction: pn = .25, pu = .13, P22 — -64, P24 = .19, 
p3S = .19, P44 = .38 and pee = -25, all with p — .000. In Figure 10, the weights 
corresponding to the component correlations are represented. Figures 10(c), (d), (f) 
and (g) point to the cross-like nature of the data. Figures 10(a) and (c) indicate that 
the up-arm is heavier than the down-arm. We did not find a meaningful explanation 
for Figure 10(b). 

6.4. Discussion If a researcher investigating the association between two vari- 
ables decides on the use of p*, we recommend the following approach. First a test of 
the significance of p* should be done, and, if found to be significant, the weights Wi 
should be visualized as described above in order to determine the nature of the asso- 
ciation. If this does not yield the desired insight, it can be worthwhile to investigate 

(k I) 

the component correlations and visualize the corresponding weights W^; ' . These 
components form a (unique) orthogonal decomposition of the 'infinite-dimensional' 
object p* into 'one-dimensional' objects pki, and the orthogonality ensures, in a 
limited sense, that the different components measure different things; by the latter 
we mean that for large samples and close to independence, the sample component 
correlations are approximately independent. The question may arise: why not in- 
vestigate correlations between other sets of orthogonal functions? A sketch of an 
answer is as follows. Because of the various optimality properties of the eigenfunc- 
tions in describing the marginal kernels, these component correlations are likely to 
be a better choice for investigating the deviation of p* from zero than correlations 
between arbitrarily chosen functions for the marginal distributions. One way to 
make this intuitive is as follows: in those regions where the marginal distributions 
are sparse, the eigenfunctions vary relatively slowly (in second derivative sense, see 
remark after Lemma 7), and therefore power of a test based on a component cor- 
relation will be concentrated in those regions where there are many observations. 



44 



Wicher Bergsma 







•.v3e>< ■••" 


jj 








•• .1 : ^ 




; ■; * .* .' 


■ 




m 


^^ 




^WhI^h 



(a) pii = .25 



(b) P12 = .13 



(c) P22 = -64 







^^ 




M-<M 








:;••■■-- 




Wi-l-^:: 




J0: 

1'* ' 




•■■■■'. ■ 



H 


r 




K??'-' 










y^- 


















|:.*^■«r- 










■■■y.i-- 




- 




#; 




.* 






^:* 




i.'i": 








^ 


.« 


.ft: 


n 








:m 



(d) P24 = .19 



(e) P33 = .19 



(f) P44 = .38 
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(g) P66 = .25 



Fig. 10. Representation of weights contributing to pij for data in Figure 9(b) 
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Thus, if we believe p* to be a good measure of deviation from independence, then 
good 'one-dimensional' objects to look at are the correlations between the marginal 
eigenfunctions of hp^ and hp^ . 
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